Import libraries
In [1]:
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
from sklearn import pipeline, preprocessing, compose, linear_model, impute, model_selection
Load data
In [2]:
df = pd.read_csv("/data/insurance.csv")
df.head()
Out[2]:
Create X - features and y - target variable. Take log transformation to tackle outliners.
In [3]:
target = "charges"
y = np.log10(df[target])
X = df.drop(columns=[target])
Identitify categorical columns and numeric columns. We will apply imputer (to replace null values) and one-hot encoding to categorical columns and imputer (to replace null values), polynomial transformation and standard scaler (z-scoring) to numeric values.
In [4]:
cat_columns = ["gender", "smoker", "region"]
num_columns = ["age", "bmi", "children"]
Build pipeline for numeric and categorical variables. Search over a hyper parameter space to tune the model.
In [5]:
cat_pipe = pipeline.Pipeline([
('imputer', impute.SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', preprocessing.OneHotEncoder(handle_unknown='error', drop="first"))
])
num_pipe = pipeline.Pipeline([
('imputer', impute.SimpleImputer(strategy='median')),
('poly', preprocessing.PolynomialFeatures(degree=2, include_bias=False)),
('scaler', preprocessing.StandardScaler()),
])
preprocessing_pipe = compose.ColumnTransformer([
("cat", cat_pipe, cat_columns),
("num", num_pipe, num_columns)
])
estimator_pipe = pipeline.Pipeline([
("preprocessing", preprocessing_pipe),
("est", linear_model.ElasticNet(random_state=1))
])
param_grid = {
"est__alpha": 0.0 + np.random.random(10) * 0.02,
"est__l1_ratio": np.linspace(0.0001, 1, 20),
}
gsearch = model_selection.GridSearchCV(estimator_pipe, param_grid, cv = 5, verbose=1, n_jobs=8)
gsearch.fit(X, y)
print(gsearch.best_score_, gsearch.best_params_)
Find estimated values. Since we did not create tran-test split manually, we would get the estimated outcome of the entire dataset. Plot the residuals.
In [6]:
y_pred = gsearch.predict(X)
plt.scatter(y, y_pred - y)
plt.xlabel("Actual")
plt.ylabel("Residual")
plt.title("Residual Plot")
Out[6]:
Show the few actual values vs predicted values.
In [7]:
pd.DataFrame({"actual": y, "predict": y_pred}).sample(10)
Out[7]:
Save the model as pickle file, so that during prediction we can re-use the trained model without training the model again from scratch.
In [8]:
with open(r"/tmp/model.pickle", "wb") as f:
pickle.dump(gsearch, f)
Reload the model from the disk. In real-use case, probably you will keep the following lines and their dependencies in a seperate script file.
In [9]:
import pickle
import pandas as pd
with open(r"/tmp/model.pickle", "rb") as f:
est = pickle.load(f)
Create a single record with the feature values to get the estimate.
In [10]:
record = {"age": 18, "gender": "male", "bmi": 33.0, "smoker": "no", "children": 1, "region": "southeast"}
record
Out[10]:
Create a dataframe out of the record.
In [11]:
df_input = pd.DataFrame.from_dict([record])
df_input
Out[11]:
Get the prediction for the df_input.
In [12]:
10 ** est.predict(df_input)
Out[12]: